Human and automatic speech recognition in the presence of speech-intrinsic variations
نویسنده
چکیده
Despite several decades of research, automatic speech recognition (ASR) lacks the performance achieved by human listeners. One of the major challenges in ASR is to cope with the immense variability of spoken language, which can be categorized into extrinsic sources (e.g., additive noise) and intrinsic factors (such as speaking rate, style, effort, dialect, and accent). What can we learn from the biological blueprint, and which cues important in human speech recognition (HSR) should be considered to improve ASR performance? The scope of this thesis is to answer these questions by comparing the HSR and ASR performance and based on these results to suggest an alternative way of feature extraction to improve ASR. The comparison is based on the Oldenburg Logatome Corpus, which is a database that contains simple nonsense words consisting of phoneme triplets and which covers the intrinsic variations mentioned above. The man-machine-gap in terms of the signal-to-noise ratio (SNR) was estimated to be 15 dB, i.e., the masking level in ASR has to be lowered by 15 dB to achieve the same performance as human listeners. The contributions to this gap could be attributed to the individual processing steps of the ASR system: The information loss caused by the feature extraction resulted in an SNR-equivalent information loss of 10 dB, while suboptimal classification accounted for the remaining 5 dB of the overall gap. Moreover, the analysis of intrinsic variations showed that human listeners are superior to ASR systems in exploiting temporal cues. These findings motivated the use of spectro-temporal Gabor features in ASR, which were found to exhibit increased robustness against a wide range of noise types. In the presence of intrinisic variations of speech, Gabor features increase the overall performance regarding several factors (such as speaking effort and style), which suggests to incorporate both spectro-temporal and temporal cues in future ASR systems.
منابع مشابه
A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation
Abstract Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...
متن کاملDesigning and implementing a system for Automatic recognition of Persian letters by Lip-reading using image processing methods
For many years, speech has been the most natural and efficient means of information exchange for human beings. With the advancement of technology and the prevalence of computer usage, the design and production of speech recognition systems have been considered by researchers. Among this, lip-reading techniques encountered with many challenges for speech recognition, that one of the challenges b...
متن کاملPersian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods
Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...
متن کاملشبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملمدل میکروسکوپی دوگوشی مبتنی بر فیلتر بانک مدولاسیون برای پیش گویی قابلیت فهم گفتار در افراد دارای شنوایی عادی
In this study, a binaural microscopic model for the prediction of speech intelligibility based on the modulation filter bank is introduced. So far, the spectral criteria such as the STI and SII or other analytical methods have been used in the binaural models to determine the binaural intelligibility. In the proposed model, unlike all models of binaural intelligibility prediction, an automatic ...
متن کاملSpeech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions
Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its perfor...
متن کامل